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ABSTRACT 



Detecting harmful or illegal intrusions into a computer 
network or into restricted portions of a computer network 
uses statistical analysis to match user commands and pro- 
gram names with a template sequence. Discrete correlation 
matching and permutation matching are used to match 
sequences. The result of the match is input to a feature 
builder and then a modeler to produce a score. The score 
indicates possible intrusion. A sequence of user commands 
and program names and a template sequence of known 
harmful commands and program names from a set of such 
templates are retrieved. A closeness factor indicative of the 
similarity between the user command sequence and a tem- 
plate sequence is derived from comparing the two 
sequences. The user command sequence is compared to each 
template sequence in the set of templates thereby creating 
multiple closeness or similarity measurements. These mea- 
surements are examined to determine which sequence tem- 
plate is most similar to the user command sequence. A 
frequency feature associated with the user command 
sequence and the most similar template sequence is calcu- 
lated. It is determined whether the user command sequence 
is a potential intrusion into restricted portions of the com- 
puter network by examining output from a modeler using the 
frequency feature as one input. 

30 Claims, 7 Drawing Sheets 
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COMPUTER NETWORK INTRUSION 
DETECTION 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 5 
The present invention relates generally to the field of 

computer systems software and computer network security. 
More specifically, it relates to software for detecting intru- 
sions and security violations in a computer system using 
statistical pattern analysis techniques. 10 

2. Discussion of Related Art 

Computer network security has been an important issue 
for all types of organizations and corporations for many 
years. Computer break-ins and their misuse have become J5 
common features. The number, as well as sophistication, of 
attacks on computer systems is on the rise. Often, network 
intruders have easily overcome the password authentication 
mechanism designed to protect the system. With an 
increased understanding of how systems work, intruders 2Q 
have become skilled at determining their weaknesses and 
exploiting them to obtain unauthorized privileges. Intruders 
also use patterns of intrusion that are often difficult to trace 
and identify. They use several levels of indirection before 
breaking into target systems and rarely indulge in sudden 25 
bursts of suspicious or anomalous activity. If an account on 
a target system is compromised, intruders may carefully 
cover their tracks as not to arouse suspicion. Furthermore, 
threats like viruses and worms do not need human supervi- 
sion and are capable of replicating and traveling to con- 3Q 
nected computer systems. Unleashed at one computer, by the 
time they are discovered, it is almost impossible to trace 
their origin or the extent of infection. 

As the number of users within a particular entity grows, 
the risks from unauthorized intrusions into computer sys- 35 
tems or into certain sensitive components of a large com- 
puter system increase. In order to maintain a reliable and 
secure computer network, regardless of network size, expo- 
sure to potential network intrusions must be reduced as 
much as possible. Network intrusions can originate from 40 
legitimate users within an entity attempting to access secure 
portions of the network or can originate from "hackers" or 
illegitimate users outside an entity attempting to break into 
the entity's network. Intrusions from either of these two 
groups of users can be damaging to an organization's 45 
computer network. 

One approach to detecting computer network intrusions is 
analyzing command sequences input by users or intruders in 
a computer system. The goal is to determine when a possible 
intrusion is occurring and who the intruder is. This approach 50 
is referred to broadly as intrusion detection using pattern 
matching. Sequences of commands (typically operating sys- 
tem or non-application specific commands) and program or 
file names entered by each user are compared to anomalous 
command patterns derived through historical and other 55 
empirical data. By performing this matching or comparing, 
security programs can generally detect anomalous command 
sequences that can lead to detection of a possible intrusion. 

FIG. 1 is a block diagram of a security system of a 
computer network as is presently known in the art. A 60 
network security system 10 is shown having four general 
components: an input sequence 12; a set of templates of 
suspect command sequences 14; a match component 16; and 
an output score 18. Input sequence 12 is a list of commands 
and program names entered in a computer system (not 65 
shown) in a particular order over a specific duration of time. 
The commands entered by a user that are typically external 
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to a specific user application (e.g., a word processing pro- 
gram or database program) can be broadly classified as 
operating system level commands. The duration of time 
during which an input sequence is monitored can vary 
widely depending on the size of the network and the volume 
of traffic. Typical durations can be from 15 miuutes to eight 
hours. 

Template set 14 is a group of particular command 
sequences determined to be anomalous or suspicious for the 
given computer system. These suspect command sequences 
are typically determined empirically by network security 
specialists for the particular computer network within an 
organization or company. They are sequences of commands 
and program names that have proved in the past to be 
harmful to the network or are in some way indicative of a 
potential network intrusion. Thus, each command sequence 
is a template for an anomalous or harmful command 
sequence. Input sequence 12 and a command sequence 
template from template set 14 are routed to match compo- 
nent 16. 

Component 16 typically uses some type of metric, for 
example a neural network, to perform a comparison between 
the input sequence and the next selected command sequence 
template. Once the match is performed between the two 
sequences, score 18 is output reflecting the closeness of the 
input sequence to the selected command sequence template. 
For example, a low score could indicate that the input 
sequence is not close to the template and a high score could 
indicate that the two are very similar or close. Thus, by 
examining score 18, computer security system 10 can deter- 
mine whether an input sequence from a network user or 
hacker is a potential intrusion because the input sequence 
closely resembles a known anomalous command sequence. 

Many computer network security systems presently in use 
and as shown in FIG. 1 have some significant drawbacks. 
One is often an overly complicated and inefficient matching 
metric or technique used to compare the two command 
sequences. The definition of "closeness" with these metrics 
is typically complicated and difficult to implement. Another 
drawback is also related to the matching metric used in 
matching component 16. Typically, matching metrics pres- 
ently employed for intrusion detection in network security 
systems end their analysis after focusing only on the com- 
mand sequences themselves. They do not take into account 
other information that may be available to define the close- 
ness or similarity of the command sequences, which might 
lead to a better analysis. 

Tools are therefore necessary to monitor systems, to 
detect break-ins, and to respond actively to the attack in real 
time. Most break- ins prevalent today exploit well known 
security holes in system software. One solution to these 
problems is to study the characteristics of intrusions and 
from these, to extrapolate intrusion characteristics of the 
future, devise means of representing intrusions in a com- 
puter so that the break-ins can be detected in real time. 

Therefore, it would be desirable to use command 
sequence pattern matching for detecting network intrusion 
that has matching metrics that are efficient and simple to 
maintain and understand. It would be desirable if such 
matching metrics took advantage of relevant and useful 
information external to the immediate command sequence 
being analyzed, such as statistical data illustrative of the 
relationship between the command sequence and other users 
on the network. It would also be beneficial if such metrics 
provided a definition of closeness between two command 
sequences that is easy to interpret and manipulate by a 
network intrusion program. 
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SUMMARY OF THE INVENTION 

To achieve the foregoing, methods, apparatus, and 
computer-readable medium are disclosed which provide 
computer network intrusion detection. In one aspect of the 
invention, a method of detecting an intrusion in a computer 
network is disclosed. A sequence of user commands and 
program names and a template sequence of known harmful 
commands and program names from a set of such templates 
are retrieved. A closeness factor indicative of the similarity 
between the user command sequence and the template 
sequence is derived from comparing the two sequences. The 
user command sequence is compared to each template 
sequence in the set of templates thereby creating multiple 
closeness factors. The closeness factors are examined to 
determine which sequence template is most similar to the 
user command sequence. A frequency feature associated 
with the user command sequence and the most similar 
template sequence is calculated. It is then determined 
whether the user command sequence is a potential intrusion 
into restricted portions of the computer network by exam- 
ining output from a modeler using the frequency feature as 
one input. Advantageously, network intrusions can be 
detected using matching metrics that are efficient and simple 
to maintain and understand. 

In one embodiment, the user command sequence is 
obtained by chronologically logging commands and pro- 
gram names entered in the computer network thereby cre- 
ating a command log, and then arranging the command log 
according to individual users on the computer network. The 
user command sequence is identified from the command log 
using a predetermined time period. In another embodiment, 
the frequency of the user command sequence occurring in a 
command stream created by a network user from a general 
population of network users is determined. Another fre- 
quency value of how often the most similar sequence 
template occurs in a command stream created by all network 
users in the general population of network users is deter- 
mined. The two frequency values are used to calculate a 
frequency feature. 

In another aspect of the present invention, a method of 
matching two command sequences in a network intrusion 
detection system is described. A user sequence having 
multiple user commands is retrieved, along with a template 
sequence having multiple template commands. The shorter 
of the two sequences is transformed to match the length of 
the longer sequence using unique, reserved characters. A 
similarity factor is derived from the number of matches 
between the user commands and the template commands by 
performing a series of comparisons between the user 
sequence and the template sequence. Similarity factors 
between the user sequence and each one of the template 
sequences are stored. The similarity between the user 
sequence and each one of the template sequences is deter- 
mined by examining the similarity factors, thereby reducing 
the complexity of the matching component of the computer 
network intrusion system. Advantageously, this method per- 
forms better than the prior art it is less complex and easier 
to maintain. In one embodiment, the similarity factor is 
derived by shifting either the user commands in the user 
sequence or the template commands in the template 
sequence before performing each comparison. 

In another aspect of the invention, another method of 
matching two command sequences in a network intrusion 
detection system is described. A user sequence having 
multiple user commands is retrieved, along with a template 
sequence having multiple template commands. A user sub- 
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string and a template substring are created. The user sub- 
string has user commands found in the template sequence 
and the template substring has stored commands found in 
the user sequence. The number of alterations needed to 

5 reorder either the user substring or the template substring to 
have the same order as one another is saved. The number of 
alterations needed to make the two substrings the same is 
indicative of the similarity between the user sequence and 
each one of the template sequences from the set of template 

10 sequences. 

In one embodiment, an alteration is an inversion in which 
adjacent user commands or template commands are inverted 
until the order of commands in the two substrings are the 
same. In another embodiment, the number of alterations is 

15 normalized by dividing the number of alterations by the 
number of alterations that would be needed to make the two 
substrings the same if the commands in the substrings were 
in complete opposite order. 
In another aspect of the invention, a system for detecting 

20 an intrusion in a computer network is described. An input 
sequence extractor retrieves a user input sequence, and a 
sequence template extractor retrieves a sequence template 
from a template set. A match component compares the user 
input sequence and the sequence template to derive a 

25 closeness factor. The closeness factor indicates a degree of 
similarity between the user input sequence and the sequence 
template. A features builder calculates a frequency feature 
associated with the user input sequence and a sequence 
template most similar to the user input sequence. A modeler 

30 uses the frequency feature as one input and output from the 
modeler can be examined to determine whether the user 
input sequence is a potential intrusion. 

In one embodiment of the invention, the user input 
extractor has a command log containing commands and 

35 program names entered in the computer network and 
arranged chronologically and according to individual users 
on the computer network. The user input extractor also 
contains a sequence identifier that identifies the user input 
sequence from the command log using a given time period. 

40 In another embodiment of the invention, the sequence tem- 
plate extractor also has a command log that contains, in a 
chronological manner, commands and program names 
entered in the computer network. The extractor also has a 
command sequence identifier for identifying a command 

45 sequence determined to be suspicious from the command 
log, and a sequence template extractor that creates the 
sequence template from the command sequence. In yet 
another embodiment, the match component has a permuta- 
tion matching component that compares the user input 

50 sequence and a sequence template from the sequence tem- 
plate set. In yet another embodiment, the match component 
has a correlation matching component that compares the 
user input sequence template and a sequence template from 
the sequence template set. 

55 BRIEF DESCRIPTION OF THE DRAWINGS 

The invention may be best understood by reference to the 
following description taken in conjunction with the accom- 
panying drawings in which: 

60 FIG. 1 is a block diagram of a security system of a 
computer network as is presently known in the art. 

FIG. 2 is a block diagram of a computer network security 
system in accordance with the described embodiment of the 
present invention. 

65 FIG. 3 is a flow diagram showing the extraction of an 
input sequence in accordance with the described embodi- 
ment. 
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FIG, 4 A is a flow diagram showing a process for creating most similar to a user input sequence and calculates a 

and installing templates of anomalous command sequences frequency feature, among other features, for use by modeler 

in accordance with the described embodiment. 24. A description of features, as the term is used in the 

FIG. 4B is a block diagram showing a series of command P rcse ° l invention, and a method of building such features are 

sequence templates and the format of a typical command 5 provided in FIG. 6. In one embodiment of the present 

invention, the computer system of FIG. 10 may be used to 

4 * implement any or all of the processes of FIGS. 3, 4A, and 

FIG. 5 is a flow diagram of a process for matching or 5 through 9 

comparing command sequences in accordance with the FIG. 3 is a flow diagram showing the extraction of an 

described embodiment of the present invention. ^ input in accordance with one embodiment of the 

FIG. 6 is a flow diagram of a process for building features present invention. FIG. 3 depicts one technique for obtaining 

for use in a network intrusion program in accordance with input sequence 12 from the commands and program names 

the described embodiment of the present invention, entered by all network users. At step 302 a designated 

FIG, 7 is a flow diagram of a modeling process in component of a computer network (such as a network server 

accordance with the described embodiment of the present 3S or a specially designated client computer) sequentially logs 

invention. commands and program names entered by users logged onto 

FIG. 8 is a flow diagram of a discrete correlation matching the network. In a typical computer network, commands 

process that can be used in step 506 of FIG. 5 in accordance entered by users are logged chronologically. As they are 

with the described embodiment of the present invention. entered into the log, each command is time stamped and can 

FIG. 9 is a flow diagram of a permutation matching 20 be identified by user names or identifiers. In another pre- 
process in accordance with the described embodiment of the [^ed embod,ment, commands and program names can be 
present invention logged based on catena other than time. For example, 
- A . ^ •« commands can be arranged in the log according to user name 

FIG. 10 is a block diagram of a typical computer system or idcmifier 0f according t0 classcs or types of commands. 

suitable for implementing an embodiment of the present ^ Examp i es of commands logged by the network are "DIR" 

invention. "COPY" "DELETE" or "PRINT," etc. Examples of program 

DETAILED DESCRIPTION names are Matlab, SAS, and MATTIEMATICA. 

At step 304 data in the log file is parsed according to user 

Reference will now be made in detail to a preferred name or identifier while the time sequence is maintained, 

embodiment of the invention. An example of the preferred 3Q Thus, in the described embodiment, individual sub-logs are 

embodiment is illustrated in the accompanying drawings. created for each user which contains all the commands and 

While the invention will be described in conjunction with a program names entered by the user in the same chronologi- 

preferred embodiment, it will be understood that it is not C al order in which they were logged at step 302. In another 

intended to limit the invention to one preferred embodiment. preferred embodiment, this parsing may not be necessary if, 

To the contrary, it is intended to cover alternatives, 35 f or example, the commands and program names were ini- 

modifications, and equivalents as may be included within tially logged and sorted. At step 306 a particular command 

the spirit and scope of the invention as defined by the sequence is identified for one or more users. The sequence 

appended claims. length, which can range typically from 15 minutes to eight 

A method and system for comparing command sequences hours (or any other suitable time length), depends on specific 

for use in a computer network intrusion detection program 40 characteristics of the network, such as the number and type 

is described in the various figures. The metrics or methods of users, traffic volume, and other security considerations, 

used in the present invention for comparing command Thus, in the described embodiment, steps 302 to 306 may be 

sequences utilize statistical data relating to a command run as often as necessary to obtain meaningful and practical 

sequence and computer network users to provide better command sequences from users. It can also run for one user 

analysis. In addition, the metrics are significantly less com- 45 if that user is singled out for any reason or for all network 

plex and more efficient than other metrics used for the same users to maintain a constant watch over network activity, 

purpose, namely computer network security. FIG. 4A is a flow diagram showing a process for creating 

and installing templates of anomalous command sequences 
in accordance with one embodiment of the present inven- 
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FIG. 2 is a block diagram of a computer network security 50 tion. As used herein, command sequence includes both 

system 100 in accordance with the described embodiment of commands and program names. It shows in greater detail a 

the present invention. Network security system 100 may process for providing template set 14 of FIGS. 1 and 2. As 

have components generally similar to those found in system described above, template set 14 is a set of known command 

10 of FIG. 1. Similar components may include input sequences that are determined to be anomalous or in some 

sequence 12 and template set 14. Network security system 55 manner suspect. In another preferred embodiment, the com- 

100 also has a match component 20, a features builder 22, mand sequences in the set of templates 14 can also be 

and a modeler 24. In the described embodiment of the indicative of other actions by users that for some reason is 

present invention, match component 20 uses a matching of interest to a network administrator. That is, the sequences 

technique such as the two matching techniques described in can be normal or non-suspect, but still be of interest to a 

greater detail in FIGS. 5, 8, and 9. These matching tech- 60 network security system. 

niques compare a command sequence segment received At step 402 the system logs command and program names 

from a user over a specific duration of time to a template entered by users. In the described embodiment, step 402 

command sequence from template set 14. The generation may be performed as in step 302. Typically, the commands 

and format of command sequence templates is described in are logged over a long period, such as months or years. At 

greater detail in FIGS. 4A and 4B. In a specific embodiment 65 step 404, security system experts and network security 

of the present invention, features builder 22 takes as input a designers identify suspicious or anomalous command 

particular command sequence template determined to be sequences. These identified command sequences are the 
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command sequence templates in component 14. The related techniques that use statistical analysis are bayesian 

sequences can be derived using historical or empirical data networks and abnormality statistics. Both metrics output a 

observed and collected by security experts for a particular "closeness factor" or other variable indicating how similar 

computer network as is known in the art. The length of the input sequence X is to the currently selected template Y 

suspicious command sequences can vary. Thus, set of tern- 5 This closeness factor or score from the match is stored in 

plates 14 can contain templates of varying length. At step memory by the security program at step 508. The program 

406 the identified sequences are installed as templates in then checks whether there are any more templates in tem- 

component 14 as part of the network's intrusion detection plate setl4 at step 510. If there are, the program retrieves the 

system 100. Through the process of FIG. 4A, a set of next template at step 504, and the match and store steps 506 

templates is created and can be added to whenever a newly 3Q and 508 are repeated. If there are no more templates, at step 

identified suspicious command sequence is discovered. The 512 the network security program determines which tem- 

process of generating templates of command sequences is plate Y' is most similar to input sequence X by examining 

then complete. the closeness factors or scores stored at step 508. At this 

Related to FIG. 4A, FIG. 4B shows a series of command point the matching process is complete. In the described 

sequence templates and the format of a typical command 35 embodiment, the matching process of FIG. 5 can be done 

sequence. A set of templates 408 includes multiple num- multiple times for the same user or for multiple users 

bered templates, an example of which is shown at row 410. throughout a predetermined time period, such as a day. 

Following the template number or other unique identifier FIG. 6 is a flow diagram of a process for building features 

412, is a sequence of commands and program names 414. In for use in a network intrusion program in accordance with 

this example, each of the letters k, 1, m, and n represents a 2Q the described embodiment of the present invention. FIG. 6 

particular command and program name. In the described depicts a process implemented by feature builder 22 of FIG. 

embodiment, the order of the commands as they appear in 2. The input needed to begin building features, for example 

the template reflects the order in which they should be a frequency feature, is the template Y' determined to be the 

entered by a user to be considered close or similar. A user most similar to the input sequence X being analyzed as 

can enter other commands between the commands shown in 25 shown in FIG. 5. Template Y f is retrieved at step 602. 

the template without negating the closeness of the user's Examples of other features are the number of audit records 

input sequence to a particular command sequence template. processed for a user in one minute; the relative distribution 

The number of templates in a set of templates depends on of file accesses, input/output activity over the entire system 

network specific factors and the level of intrusion detection usage; and the amount of CPU and input/output activity 

(or other type of detection) desired by the network admin- 30 from a user. 

istrator. In another preferred embodiment, the format of At step 604 the program determines a frequency, /(Y'), of 

template set 408 can emphasize other criteria. For example, user input sequence X within a predetermined time period T. 

the more important templates or most commonly detected The program calculates how often the input sequence X, 

sequences can appear first or "higher up" in the template set, whose closest template is Y appears amongst all sequences 

or the templates can be sorted based on the length of the 35 entered by the same user during time period T. Preferably, 

command sequence template. the duration of time period T used in this step is greater than 

FIG. 5 is a flow diagram of a process for matching or the sequence length of the input sequence from step 306. 

comparing command sequences in accordance with the Thus, if the user input sequence contains commands and 

described embodiment of the present invention. It depicts a program names entered by a user over 30 minutes, time 

process implemented by match component 20 in FIG. 2. At 40 period T is preferably longer than 30 minutes, 

step 502 the network security program retrieves an input At step 606 the program determines how often Y appears 

command sequence input by a user in close to real time. amongst input sequences from the general population, such 

Such a command sequence may be created as shown in FIG. as all users on the network or some subset of those users. 

3. For the purpose of illustrating the described embodiment, How often Y occurs amongst input sequences from the 

an input command sequence of any particular length is 45 general population can be determined by keeping a real-time 

represented by the variable X and is comprised of discrete log or record of all sequences of activity by all users in one 

commands or program names. As described above, the long string without regard to what activity was performed by 

length of X depends on network specifications and security which users. It can then be determined how often Y appears 

needs and thus can vary. in that long string or real-time log. In another embodiment, 

At step 504 the network security program retrieves one 50 an average of all the individual averages of each user is 

command sequence template, represented by Y, from tem- taken to determine how often Y' occurs amongst input 

plate set 14. As described in FIGS. 4A and 4B, Y can be a sequences from the general population. Each user has an 

sequence of commands that has been determined to be average number of occurrences Y is entered by the user in 

suspicious or anomalous and is comprised of discrete com- time period T. At step 606, this frequency of occurrence, 

mands and program names. The length of Y also varies and 55 .f(Y')> is determined over time period T. This frequency is 

does not have to be the same as X. In the described used to convey how often other users of the network have 

embodiment, the first time a sequence X is analyzed, the entered input sequences that are the same as or highly 

program selects the first template from template set 14, and similar to Y\ 

retrieves the next template in the second iteration, as At step 608, a frequency feature is calculated for the user 

described below in step 510. In another preferred 60 based on F(Y') and F(Y'). In the described embodiment, the 

embodiment, the order of template Y selection can be based frequency feature can be in the form of a real number and is, 

on other criteria such as frequency, importance, or length. in one respect, a running average. This allows the program 

At step 506 input sequence X and template Y are matched to calculate an average occurrence or frequency level 

or compared. Any suitable matching metric may be used. In dynamically without having to store and retrieve from 

the described embodiment, the network security program 65 memory the multiple values that would be needed to calcu- 

uses discrete correlation or permutation matching as match- late a static average. A frequency feature is calculated using 

ing metrics described in FIGS. 8 and 9 respectively. Other the variable F(Y') described above in addition to three other 
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frequency faluie=(F(Y)-F new (r))/STD(F ne „(r)). 



10 



variables: F'^JY^F'^OO, and a. In the described 
embodiment, before the frequency feature is determined, 
F' WVV (Y') is calculated using the following equation: 

In the described embodiment, when a frequency feature is 
calculated, F' WH ,(Y') represents the average number of times 
the template with the highest closeness factor, Y', occurs in 
the general population of network users. The general popu- 
lation of network users includes all other users on the 
network. In another preferred embodiment, the general 
population can represent a specific subset of all network 
users. P^hXY 1 ) is then used in a second equation along with 
F(Y') to calculate a frequency feature for a particular 
template, which in the described embodiment is the template 
with the highest closeness factor for a particular user. 

With respect to the first equation, two new variables, a and 
F'owOO, are used along with F(Y') to calculate F ne JX)- 
F ol Jy*) is the value from the preceding time period T. It is ^ 
essentially the F' nw (Y') from the previous calculation using 
the equation. Since there is no F'^^Y) the first time the 
equation is used for a particular template and user, F(Y') is 
used for P o/ XY') in the first pass. Thereafter, each F'^OO 
calculated becomes the F'^XY') in the next calculation. The 
variable a can be referred to as a "smoothing" factor that 
allows the security program to assign different weights to 
F' o/ XY*) and F(Y'), In the described embodiment, F(Y) 
represents the most recent average number of times the 
general population uses Y 1 , and is given greater weight than 3Q 
the previous average, represented by F'^/Y'). Thus, a may 
be approximately 0,1 or so, giving F(Y) approximately 90% 
greater weight (since it is being multiplied by 1-a) and 
giving F'^^OO approximately 10% weight in comparison. 
In the described embodiment, a may be between 0 and 1 and ^ 
is preferably adjusted based on empirically derived data to 
fine tune the network intrusion detection system. This opera- 
tion is generally referred to as smoothing. In another pre- 
ferred embodiment, other mechanisms or adjustments can be 
used to fine tune the program by factoring in empirically 
derived information. 

Once an F'^^Y') is calculated using the first equation, the 
frequency feature can be calculated using it, its standard 
deviation, and F(Y l ). In the described embodiment, the 
following equation is used: 



25 



40 



45 



A standard deviation of the most recent average number of 
times Y' has been used by the general population is used to 
divide the difference between F(Y') and F^jfY'). In another 50 
preferred embodiment, a running standard deviation can be 
used in place of the static standard deviation used in the 
equation above using techniques well known in the art. In 
the described embodiment, the frequency feature calculated 
is in one respect a running average of the number of times 55 
the template Y' occurs amongst all users in the general 
population. 

Once the frequency feature is calculated at step 608, the 
program can calculate any other features for the security 
program at step 610. In the described embodiment, the 60 
frequency feature is one example of a feature that can be 
used in the present invention. In another preferred 
embodiment, the network intrusion program can be entirely 
frequency-based in which no other features are used to 
create input for the modeler described below. For example, 65 
in an all frequency-based program, the templates with the 
top five scores or closeness factors could be chosen instead 



of just the template with the highest score. Frequencies can 
be calculated for each one of the templates and used as the 
only input to the modeler. In either case, the feature building 
process is complete after step 610. 

FIG. 7 is a flow diagram of a modeling process in 
accordance with the described embodiment of the present 
invention. It shows in greater detail a process implemented 
by modeler 24 of FIG. 2. As is well known in the field of 
computer science, there are many different modeling pro- 
cesses to choose from, such as linear regression or neural 
networks, to name just two. The actual modeling step is 
described below in step 706. Although any one of the 
numerous well known modeling techniques (e.g. Markov 
models, graphical models, regression models) can be used, 
whichever technique is used, the specific model used will be 
trained to evaluate the features to recognize the possibility of 
a network intrusion. This training can be done by providing 
the model frequency feature as defined above which is based 
on command and program name sequences known from 
previous experience to result in actual network intrusions. 
Other input to the modeling process include failed log-in 
time intervals (i.e., the time between failed log-in attempts) 
and other non-command related features. By training the 
modeler to recognize these sequences and the associated 
frequency features, it can recognize which sequences are 
intrusions and which are not. 

At step 702, the modeling process accepts as input the 
frequency feature calculated in FIG. 6 or other features. In 
the described embodiment, one frequency feature is used as 
input, however in another preferred embodiment, multiple 
frequency features or other types of features described above 
can be used as input. At step 704, the frequency feature is 
combined with any other features based on the specific 
modeler. As is well known in the art, a modeler can accept 
different types of features. Once the modeler receives all 
input features, it calculates a score in step 706, which is 
distinct from the closeness factor calculated at step 508. This 
score is based upon the input features and the training the 
model has received. 

At step 708 the security program uses the score derived in 
step 706 to determine whether an intrusion has occurred or 
has been attempted. The score derived by the modeler can be 
in the form of a real number or integer. In addition, it can be 
normalized to a number between 1 and 100 to facilitate 
evaluation by a human being. Once the modeler has deter- 
mined whether an intrusion has occurred based on the score, 
the process is complete. 

At step 506 of FIG. 5, a user input command sequence X 
and a sequence template Y are matched or compared. Two 
preferred techniques used to perform the match will now be 
described. FIG. 8 is a flow diagram of a "discrete correla- 
tion" matching process that can be used in step 506 of FIG. 
5 in accordance with the described embodiment of the 
present invention. Another method, shown in FIG. 9, is 
referred to as "permutation matching " 

At step 802 input sequence X retrieved in step 502 
representing a string of command sequences and program 
names entered by a particular user in time period T is 
identified. At step 804 command sequence template Y, 
retrieved in step 504 is identified. Since the two sequences 
are not necessarily the same length, the shorter sequence is 
transformed before the actual comparison operation. In the 
described embodiment, this is done by padding the shorter 
sequence after the far right character with a reserved unique 
character which should not appear in either sequence (or in 
any possible command sequence) until the two sequences 
are the same length. This process is referred to as Padding 
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(X, Y). In another preferred embodiment, other well known 
techniques can be used to transform either or both of the 
sequences to make them suitable for comparison. 

Once the sequences have the same length, the first stage 
of the matching can be performed at step 808, Initially, each 
discrete command or program name in one sequence is 
compared to a corresponding command or program name in 
the other sequence. For example, if X=[a,b,c,d] and Y=[a, 
e,f,d], a is compared to a, b is compared to e, c is compared 
to f, and d is compared to d. The number of matches is 
recorded to derive a ratio. In this example, there are two 
matches: the a's and the d's, out of four comparisons. Thus, 
the ratio for this initial round of comparisons is 2/4, or 1/2. 

At step 810, in the described embodiment, one of the 
sequences (it does not matter which one) is shifted by one 
position to the left or the right. Once one of the sequences 
is shifted by one position, control returns to step 808 and the 
same comparison operation is performed. Following the 
same example, after shifting Y one position to the right (or 
shifting X one position to the left), the b in X is compared 
to the a in Y, the c is compared to e, and the d is compared 
to the f in Y. The first a in X and the last d in Y are not 
compared to any other commands. Out of the three com- 
parisons in the second round, there are no matches. Thus, the 
ratio from this round is 0/3, which is recorded by the security 
program. These ratios will be summed at step 812 described 
in more detail below. Control then returns to step 810 where 
one of the sequences is shifted again. For example, the Y 
sequence can be shifted once again to the right. The 
sequences are shifted until there are no more commands in 
either sequence to be compared. To complete the example, 
at step 808 the c in X is compared to the a in Y, and the d 
is compared to the e in Y. The ratio, in this case 0/2, is stored. 
The process is repeated one more time where d is compared 
to a providing the ratio 0/1. 

In another preferred embodiment, one of the sequences 
can be shifted by two or more positions instead of just one. 
This can be done if the sequences are lengthy or for 
efficiency, although at the expense of accuracy in deriving 
the closeness factor. In another preferred embodiment, a 
closeness factor can be derived after performing the com- 
parison with a one-position shift, as described above, and 
another closeness factor derived after performing the com- 
parison with a two-position shift, and so on. After building 
a series of closeness factors using different position shifts, 
the best closeness factor is chosen. This can be done in 
numerous ways: taking the maximum, taking an average, 
taking the summation of all closeness factors, or taking the 
summation of the square values of all closeness factors, 
among other techniques. 

At step 812, the ratios are summed to provide a closeness 
factor described initially in step 508 of FIG. 5. In the 
example above, the ratios 2/4, 0/3, 0/2, and 0/1 are summed 
giving a closeness factor of 1/2. This is the closeness factor 
between input sequence X and template Y. At this stage, the 
process of matching using the discrete correlation approach 
is complete. 

FIG. 9 is a flow diagram of a permutation matching 
process in accordance with the described embodiment of the 
present invention. It shows another preferred method of 
matching two command sequences as initially shown in step 
506 of FIG. 5. At step 902 input sequence X, retrieved at step 
504, representing a string of command sequences and pro- 
gram names entered by a particular user for a predetermined 
duration of time is identified. At step 904 command 
sequence template Y is identified. In permutation matching, 
the X and Y sequences do not have to be the same length 
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before performing the comparison, in contrast to the discrete 
correlation method. 

At step 906 the security program determines the number 
of common commands and program names in the sequences 

5 X and Y. For example, if X=[a b c d e f] and Y=[d a b g c], 
the common elements are a, b, c, and d. At step 908 the 
common elements are extracted sequentially from each of 
the two sequences thereby creating two sub-groups: string X 
and string Y Following the same example, string X«[a b c 

10 d] and string Y=[d a b c]. In the described embodiment, 
when there are duplicates of a common element in one of the 
sequences, the first occurrence of the duplicate (or multiple 
occurring) command or program name is extracted and 
placed in the string for that sequence. In another 

15 embodiment, the commands or program names are chosen 
randomly as long as there is one of each of the elements that 
are common. For example, if the number of common 
elements is four, as above, but X had multiple a's and d's, 
a random four elements are chosen from X until one of each 

20 of the common elements are extracted. 

At step 910 the security program determines the number 
of adjacent-element inversions necessary to transform one of 
the strings to be identical to the other. Using the strings from 
above, the number of inversions necessary to transform 

25 string Y to be in the same order as string X is three. For 
example, where string X=[a bed] and string Y>=[d a b c], 
string Y after one inversion is [a d b c], after two inversions 
is [a b d c], and after three is [a b c d]. At step 912 the number 
of inversions is normalized to allow for a logical and 

30 accurate comparison of number of inversions for all tem- 
plates Y. The inversion values have to be normalized since 
the length of the templates are not the same. Since the length 
of the templates are not the same, large differences in the 
number of inversions among templates based solely on their 

35 length can be caused and not necessarily be based on their 
closeness to the input sequence X. In the described 
embodiment, an inversion value is normalized by dividing 
the value by the number of inversions that would be needed 
to transform one sequence to another in the worst possible 

40 scenario; that is, if string Y, for example, is in complete 
reverse order from X. Following the same example, the total 
number of inversions necessary to transform string Y to 
string X if string Y is [d c b a] is six. Thus, the matching 
factor or score derived in step 912 for this example is 3/6. 

45 With permutation matching, the best matching factor is the 
one having the lowest number. 

In the described embodiments, the closeness factor used 
to determine template Y' is derived using either the discrete 
correlation or the permutation matching method. In another 

50 embodiment, a template Y' can be determined from close- 
ness factors derived from examining scores from a combi- 
nation of both techniques. The following scenario illustrates 
this embodiment. Using discrete correlation, template A 
receives a score ranking it the highest, or most similar, to 

55 input sequence X and template B is ranked as the fifth most 
similar. In a second round of matching, this time using 
permutation matching, template A receives a score ranking 
it as third most similar to the same input sequence X and 
template B is ranked as first or the most similar to X. In this 

60 scenario, the security program still chooses Y' by determin- 
ing the best score but examines how each of the highest 
ranking templates A and B ranked using the other matching 
technique. In this scenario, the program will choose template 
A since it is ranked third using the other matching method 

65 whereas template B ranked fifth using the other matching 
technique. Thus, template A ranked higher overall compared 
to template B. In another embodiment, the matching factor 
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determined at step 912 (where the smaller the number the is used typically used for fast transfer of data and instruc- 

higher the similarity) is subtracted from one, and the security tions in a bi-directional manner over the memory bus 1008. 

program chooses Y' from the template with the highest score Also as well known in the art, primary storage 1006 typi- 

derived from both matching techniques. cally includes basic operating instructions, program code, 

5 data, and objects used by the CPU 1002 to perform its 

Computer System Embodiment functions. Primary storage devices 1004 and 1006 may 

include any suitable computer-readable storage media, 

As described above, the present invention employs van- described below, depending on whether, for example, data 
ous computer-implemented operations involving data stored access needs to be bi-directional or uni-directional. CPU 
in computer systems. These operations include, but are not 1002 can also directly and very rapidly retrieve and store 
limited to, those requiring physical manipulation of physical 1 frequently needed data in a cache memory 1010. 
quantities. Usually, though not necessarily, these quantities A ren aovable mass storage device 1012 provides addi- 
take the form of electrical or magnetic signals capable of tional data st0 rage capacity for the computer system 1000, 
being stored, transferred, combined, compared, and other- and ^ C0U pled either bi-directionally or uni-directionally to 
wise manipulated. The operations described herein that form ^ CPU 1002 v { a a peripheral bus 1014. For example, a specific 
part of the invention are useful machine operations. The removable mass storage device commonly known as a 
manipulations performed are often referred to in terms, such CD-ROM typically passes data uni-directionally to the CPU 
as, producing, matching, identifying, running, determining, ioo2, whereas a floppy disk can pass data bi-directionally to 
comparing, executing, downloading, or detecting. It is some- the CPU 1002 Storage 1012 may also include computer- 
times convenient, principally for reasons of common usage, readable media such as magnetic tape, flash memory, signals 
to refer to these electrical or magnetic signals as bits, values, embodied on a carrier wave, smart cards, portable mass 
elements, variables, characters, data, or the like. It should storage devices, holographic storage devices, and other 
remembered, however, that all of these and similar terms are storage devices. A fixed mass storage 1016 also provides 
to be associated with the appropriate physical quantities and additional data storage capacity and is coupled 
are merely convenient labels applied to these quantities. ^ bi-directionally to CPU 1002 via peripheral bus 1014. The 

The present invention also relates to a computer device, most common example of mass storage 1016 is a hard disk 

system or apparatus for performing the aforementioned drive. Generally, access to these media is slower than access 

operations. The system may be specially constructed for the to primary storages 1004 and 1006. Mass storage 1012 and 

required purposes, or it may be a general purpose computer, 1016 generally store additional programming instructions, 

such as a server computer or a mainframe computer, selec- 30 data, and the like that typically are not in active use by the 

tively activated or configured by a computer program stored CPU 1002. It will be appreciated that the information 

in the computer. The processes presented above are not retained within mass storage 1012 and 1016 may be 

inherently related to any particular computer or other com- incorporated, if needed, in standard fashion as part of 

puting apparatus. In particular, various general purpose primary storage 1004 (e.g. RAM) as virtual memory, 

computers may be used with programs written in accordance 35 in addition to providing CPU 1002 access to storage 

with the teachings herein, or, alternatively, it may be more subsystems, the peripheral bus 1014 is used to provide 

convenient to construct a more specialized computer system access other subsystems and devices as well. In the 

to perform the required operations. described embodiment, these include a display monitor 1018 

FIG. 10 is a block diagram of a general purpose computer and adapter 1020, a printer device 1022, a network interface 

system 1000 suitable for carrying out the processing in 40 1024, an auxiliary input/output device interface 1026, a 

accordance with one embodiment of the present invention. sound card 1028 and speakers 1030, and other subsystems as 

FIG. 10 illustrates one embodiment of a general purpose needed. 

computer system that, as mentioned above, can be a server The network interface 1024 allows CPU 1002 to be 
computer, a client computer, or a mainframe computer. coupled to another computer, computer network, including 
Other computer system architectures and configurations can 45 the Internet or an intranet, or telecommunications network 
be used for carrying out the processing of the present using a network connection as shown. Through the network 
invention. Computer system 1000, made up of various interface 1024, it is contemplated that the CPU 1002 might 
subsystems described below, includes at least one micro- receive information, e.g., data objects or program 
processor subsystem (also referred to as a central processing instructions, from another network, or might output infor- 
unit, or CPU) 1002. That is, CPU 1002 can be implemented 50 mation to another network in the course of performing the 
by a single-chip processor or by multiple processors. CPU above-described method steps. Information, often repre- 
1002 is a general purpose digital processor which controls sented as a sequence of instructions to be executed on a 
the operation of the computer system 1000. Using instruc- CPU, may be received from and outputted to another 
tions retrieved from memory, the CPU 1002 controls the network, for example, in the form of a computer data signal 
reception and manipulation of input data, and the output and 55 embodied in a carrier wave. An interface card or similar 
display of data on output devices. device and appropriate software implemented by CPU 1002 
CPU 1002 is coupled bi-directionally with a first primary can be used to connect the computer system 1000 to an 
storage 1004, typically a random access memory (RAM), external network and transfer data according to standard 
and uni-directionally with a second primary storage area protocols. That is, method embodiments of the present 
1006, typically a read-only memory (ROM), via a memory 60 invention may execute solely upon CPU 1002, or may be 
bus 1008. As is well known in the art, primary storage 1004 performed across a network such as the Internet, intranet 
can be used as a general storage area and as scratch-pad networks, or local area networks, in conjunction with a 
memory, and can also be used to store input data and remote CPU that shares a portion of the processing. Addi- 
processed data, such as command and program name tional mass storage devices (not shown) may also be con- 
sequences. It can also store programming instructions and 65 nected to CPU 1002 through network interface 1024. 
data, in the form of a message store in addition to other data Auxiliary I/O device interface 1026 represents general 
and instructions for processes operating on CPU 1002, and and customized interfaces that allow the CPU 1002 to send 
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and, more typically, receive data from other devices such as 
microphones, touch-sensitive displays, transducer card 
readers, tape readers, voice or handwriting recognizers, 
biometrics readers, cameras, portable mass storage devices, 
and other computers. 

Also coupled to the CPU 1002 is a keyboard controller 
1032 via a local bus 1034 for receiving input from a 
keyboard 1036 or a pointer device 1038, and sending 
decoded symbols from the keyboard 1036 or pointer device 
1038 to the CPU 1002. The pointer device may be a mouse, 
stylus, track ball, or tablet, and is useful for interacting with 
a graphical user interface. 

In addition, embodiments of the present invention further 
relate to computer storage products with a computer read- 
able medium that contain program code for performing 
various computer-implemented operations. The computer- 
readable medium is any data storage device that can store 
data that can thereafter be read by a computer system. The 
media and program code may be those specially designed 
and constructed for the purposes of the present invention, or 
they may be of the kind well known to those of ordinary skill 
in the computer software arts. Examples of computer- 
readable media include, but are not limited to, all the media 
mentioned above: magnetic media such as hard disks, floppy 
disks, and magnetic tape; optical media such as CD-ROM 
disks; magneto-optical media such as floptical disks; and 
specially configured hardware devices such as application- 
specific integrated circuits (ASICs), programmable logic 
devices (PLDs), and ROM and RAM devices. The 
computer-readable medium can also be distributed as a data 
signal embodied in a carrier wave over a network of coupled 
computer systems so that the computer-readable code is 
stored and executed in a distributed fashion. Examples of 
program code include both machine code, as produced, for 
example, by a compiler, or files containing higher level code 
that may be executed using an interpreter. 

It will be appreciated by those skilled in the art that the 
above described hardware and software elements are of 
standard design and construction. Other computer systems 
suitable for use with the invention may include additional or 
fewer subsystems. In addition, memory bus 1008, peripheral 
bus 1014, and local bus 1034 are illustrative of any inter- 
connection scheme serving to link the subsystems. For 
example, a local bus could be used to connect the CPU to 
fixed mass storage 1016 and display adapter 1020. The 
computer system shown in FIG. 10 is but an example of a 
computer system suitable for use with the invention. Other 
computer architectures having different configurations of 
subsystems may also be utilized. 

Although the foregoing invention has been described in 
some detail for purposes of clarity of understanding, it will 
be apparent that certain changes and modifications may be 
practiced within the scope of the appended claims. 
Furthermore, it should be noted that there are alternative 
ways of implementing both the process and apparatus of the 
present invention. For example, a single matching technique 
or a combination of both matching techniques can be used 
to determine the closest template. In another example, the 
discrete correlation matching can use varying gaps between 
matches depending on levels of accuracy and efficiency 
desired. In yet another example, in determining the fre- 
quency feature, a running average or a running standard 
deviation can be used in place of the averages shown in the 
described embodiment. Accordingly, the present embodi- 
ments are to be considered as illustrative and not restrictive, 
and the invention is not to be limited to the details given 
herein, but may be modified within the scope and equiva- 
lents of the appended claims. 
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What is claimed is: 

1. A method of detecting an intrusion in a computer 
network, the method comprising: 

(a) retrieving a user input sequence; 

(b) retrieving a sequence template from a plurality of 
sequence templates; 

(c) comparing the user input sequence and the sequence 
template to derive a closeness factor indicating a degree 
of similarity between the user input sequence and the 
sequence template; 

(d) calculating a frequency feature associated with the 
user input sequence and a most similar sequence tem- 
plate; and 

(e) determining whether the user input sequence is a 
potential intrusion by examining output from a modeler 
using the frequency feature as one input to the modeler. 

2. A method as recited in claim 1 further comprising: 
retrieving the most similar sequence template; 
determining a first frequency of how often the user input 

sequence occurs in a first command stream created by 
a particular network user from a plurality of network 
users; 

determining a second frequency of how often the most 
similar sequence template occurs in a second command 
stream created by the plurality of network users; and 

calculating the frequency feature using the first frequency 
and the second frequency. 

3. A method as recited in claim 2 further comprising 
calculating the second frequency using a smoothing coeffi- 
cient and a previous second frequency. 

4. A method as recited in claim 1 wherein retrieving a user 
input sequence further comprises: 

logging, in a chronological manner, commands and pro- 
gram names entered in the computer network thereby 
creating a command log; 

arranging the command log according to individual users 
on the computer network; and 

identifying the user input sequence from the command log 
using a predetermined time period. 

5. A method as recited in claim 1 wherein retrieving a 
sequence template from a plurality of sequence templates 
further comprises: 

logging chronologically commands and program names 

entered in the computer network thereby creating a 

command log; 
identifying a command sequence from the command log 

determined to be suspicious; and 
creating the sequence template from the command 

sequence. 

6. A method as recited in claim 1 further comprising: 
repeating steps (b) through (c) for each sequence template 

in the plurality of sequence templates thereby deriving 
a plurality of closeness factors; and 
determining the most similar sequence template by exam- 
ining each closeness factor from the plurality of close- 
ness factors. 

7. A method as recited in claim 1 wherein comparing the 
user input sequence and the sequence template to derive a 
closeness factor further comprises utilizing permutation 
matching to compare the user input sequence and one 
sequence template from the plurality of sequence templates. 

8. A method as recited in claim 1 wherein comparing the 
user input sequence and the sequence template to derive a 
closeness factor further comprises utilizing discrete corre- 
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lation matching to compare the user input sequence template 
and one sequence template from the plurality of sequence 
templates. 

9. A method of determining similarity between a user 
sequence and a sequence template in a computer network 
intrusion detection system using correlation matching, the 
method comprising: 

(a) retrieving the user sequence including a plurality of 
user commands; 

(b) retrieving a template sequence including a plurality of 
template commands; 

(c) transforming one of the user sequence and the tem- 
plate sequence such that the user sequence and the 
template sequence are of substantially the same length; 

(d) performing a series of comparisons between the user 
sequence and the template sequence producing 
matches; 

(e) deriving a similarity factor from the number of 
matches between the plurality of user commands and 
the plurality of template commands; and 

(f) associating the similarity factor with said template 
sequence as an indication of likelihood of intrusion, 
whereby the complexity of the computer network intru- 
sion system is low. 

10. A method as recited in claim 9 wherein transforming 
one of the user sequence and the template sequence further 
comprises: 

. determining which of the user sequence and the template 
sequence is a shorter sequence; and 
inserting one or more reserved characters at the end of the 
shorter sequence. 

11. A method as recited in claim 9 wherein deriving a 
similarity factor from the number of matches further com- 
prises shifting one of the plurality of user command ele- 
ments and the plurality of template command elements by 
one or more elements before performing each comparison of 
the series of comparisons between the user sequence and the 
template sequence. 

12. A method of determining similarity between a user 
sequence and a template sequence in a computer network 
intrusion system using permutation matching, the method 
comprising: 

retrieving the user sequence including a plurality of user 
commands; 

retrieving a template sequence including a plurality of 
stored commands; 

creating a user subset and a template subset, the user 
subset including user commands found in the template 
sequence and the template subset including stored 
commands found in the user sequence; and 

determining a number of alterations needed to reorder one 
of the user subset and the template subset to have the 
same order as one of the user subset and the template 
subset that was not reordered wherein the number of 
alterations is indicative of similarity between the user 
sequence and the template sequence, the similarity 
indicating a likelihood of intrusion, whereby the com- 
plexity of the computer network intrusion is low. 

13. A method as recited in claim 12 further comprising 
inverting adjacent user commands of the user subset until 
the order of the user commands is the same as the order of 
the stored commands of template subset. 

14. A method as recited in claim 12 further comprising 
inverting adjacent stored commands of the template subset 
until the order of the stored commands is the same as the 
order of the user commands of the user subset. 
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15. A method as recited in claim 12 further comprising 
normalizing the number of alterations by dividing the num- 
ber of alterations by a worst-case number of alterations 
wherein the worst -case number of alterations is the number 
of alterations needed to reorder one of the user subset and 
the template subset to have the same order as one of the user 
subset and the template subset that was not reordered when 
the user commands in the user subset and the stored com- 
mands in the template subset are in opposite order. 

16. A system for detecting an intrusion in a computer 
network, the system comprising: 

an input sequence extractor for retrieving a user input 
sequence; 

a sequence template extractor for retrieving a sequence 
template from a plurality of sequence templates; 

a match component for comparing the user input sequence 
and the sequence template to derive a closeness factor 
indicating a degree of similarity between the user input 
sequence and the sequence template; 

a features builder for calculating a frequency feature 
associated with the user input sequence and a most 
similar sequence template; and 

a modeler using the frequency feature as one input to the 
modeler whereby it can be determined whether the user 
input sequence is a potential intrusion by examining 
output from the modeler, 

17. A system as recited in claim 16 wherein the user input 
extractor further comprises: 

a command log containing, in a chronological manner, 
commands and program names entered in the computer 
network and arranged according to individual users on 
the computer network; and 

a sequence identifier for identifying the user input 
sequence from the command log using a predetermined 
time period. 

18. A system as recited in claim 16 wherein the sequence 
template extractor further comprises: 

a command log containing, in a chronological manner, 
commands and program names entered in the computer 
network; 

a command sequence identifier for identifying a command 
sequence from the command log determined to be 
suspicious; and 

a sequence template extractor for creating the sequence 
template from the command sequence. 

19. A system as recited in claim 16 wherein the match 
component for comparing the user input sequence and the 
sequence template further comprises a permutation match- 
ing component for comparing the user input sequence and 
one sequence template from the plurality of sequence tem- 
plates. 

20. A system as recited in claim 16 wherein the match 
component for comparing the user input sequence and the 
sequence template further comprises a correlation matching 
component to compare the user input sequence template and 
one sequence template from the plurality of sequence tem- 
plates. 

21. A computer-readable medium containing programmed 
instructions arranged to detect an intrusion in a computer 
network, the computer-readable medium including pro- 
grammed instructions for: 

(a) retrieving a user input sequence; 

(b) retrieving a sequence template from a plurality of 
sequence templates; 

(c) comparing the user input sequence and the sequence 
template to derive a closeness factor indicating a degree 
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of similarity between the user input sequence and the 
sequence template; 

(d) calculating a frequency feature associated with the 
user input sequence and a most similar sequence tem- 
plate; and 

(e) determining whether the user input sequence is a 
potential intrusion by examining output from a modeler 
using the frequency feature as one input to the modeler. 

22. A computer-readable medium as recited in claim 21 
further comprising programmed instructions for: 

retrieving the most similar sequence template; determin- 
ing a first frequency of how often the user input 
sequence occurs in a first command stream created by 
a particular network user from a plurality of network 
users; 

determining a second frequency of how often the most 
similar sequence template occurs in a second command 
stream created by the plurality of network users; and 

calculating the frequency feature using the first frequency 
and the second frequency. 

23. A computer- readable medium as recited in claim 22 
further comprising programmed instructions for calculating 
the second frequency using a smoothing coefficient and a 
previous second frequency. 

24. A computer-readable medium as recited in claim 21 
wherein the programmed instructions for retrieving a user 
input sequence further comprises programmed instructions 
for: 

logging, in a chronological manner, commands and pro- 
gram names entered in the computer network to create 
a command log; 

arranging the command log according to individual users 
on the computer network; and 

identifying the user input sequence from the command log 
using a predetermined time period. 

25. A computer- readable medium as recited in claim 21 
wherein the programmed instructions for retrieving a 
sequence template from a plurality of sequence templates 
further comprises programmed instructions for: 

logging chronologically commands and program names 

entered in the computer network thereby creating a 

command log; 
identifying a command sequence from the command log 

determined to be suspicious; and 
creating the sequence template from the command 

sequence. 

26. A computer-readable medium as recited in claim 21 
further comprising programmed instructions for: 

repeating steps (b) through (c) for each sequence template 
in the plurality of sequence templates to derive a 
plurality of closeness factors; and 

determining the most similar sequence template by exam- 
ining each closeness factor from the plurality of close- 
ness factors, 

27. A computer-readable medium as recited in claim 21 
wherein the programmed instructions for comparing the user 
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input sequence and the sequence template to derive a 
closeness factor further comprises programmed instructions 
for utilizing permutation matching to compare the user input 
sequence and one sequence template from the plurality of 
sequence templates. 

28. A computer- readable medium as recited in claim 21 
wherein programmed instructions for comparing the user 
input sequence and the sequence template to derive a 
closeness factor further comprises programmed instructions 
for utilizing discrete correlation matching to compare the 
user input sequence template and one sequence template 
from the plurality of sequence templates. 

29. A computer-readable medium containing programmed 
instructions arranged to determine similarity between a user 
sequence and a sequence template in a computer network 
intrusion detection system using correlation matching, the 
computer-readable medium including programmed instruc- 
tions for: 

(a) retrieving the user sequence including a plurality of 
user commands; 

(b) retrieving a template sequence including a plurality of 
template commands; 

(c) transforming one of the user sequence and the tem- 
plate sequence such that the user sequence and the 
template sequence are of substantially the same length; 

(d) performing a series of comparisons between the user 
sequence and the template sequence producing 
matches; 

(e) deriving a similarity factor from the number of 
matches between the plurality of user commands and 
the plurality of template commands; and 

(f) associating the similarity factor with said template 
sequence as an indication of likelihood of intrusion, 
whereby the complexity of the computer network intru- 
sion system is low. 

30. A computer- readable medium containing programmed 
instructions arranged to determine similarity between a user 
sequence and a template sequence in a computer network 
intrusion system using permutation matching, the computer- 
readable medium including programmed instructions for: 

retrieving the user sequence including a plurality of user 
commands; retrieving a template sequence including a 
plurality of stored commands; 

creating a user subset and a template subset, the user 
subset including user commands found in the template 
sequence and the template subset including stored 
commands found in the user sequence; and 

determining a number of alterations needed to reorder one 
of the user subset and the template subset to have the 
same order as one of the user subset and the template 
subset that was not reordered wherein the number of 
alterations is indicative of similarity between the user 
sequence and the template sequence, the similarity 
indicating a likelihood of intrusion, whereby the com- 
plexity of the computer network intrusion is low. 
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